LA-UR-06-3445 



O 
O 



l> 



0< 

l-H 

O 



> 

m 

(N 

O 
O 

o 

00 
O 



X 



Automatic Metadata Generation using Associative Networks* 

Marko A. Rodriguezjj Johan Bollen|^ and Herbert Van de Sompe|j 

Digital Library Research and Prototyping Team 

Los Alamos National Laboratory 

Los Alamos, New Mexico 87545 

In spite of its tremendous value, metadata is generally sparse and incomplete, thereby hampering 
the effectiveness of digital information services. Many of the existing mechanisms for the automated 
creation of metadata rely primarily on content analysis which can be costly and inefficient. The 
automatic metadata generation system proposed in this article leverages resource relationships gen- 
erated from existing metadata as a medium for propagation from metadata-rich to metadata-poor 
resources. Because of its independence from content analysis, it can be applied to a wide variety 
of resource media types and is shown to be computationally inexpensive. The proposed method 
operates through two distinct phases. Occurrence and co-occurrence algorithms first generate an 
associative network of repository resources leveraging existing repository metadata. Second, using 
the associative network as a substrate, metadata associated with metadata-rich resources is propa- 
gated to metadata-poor resources by means of a discrete-form spreading activation algorithm. This 
article discusses the general framework for building associative networks, an algorithm for dissem- 
inating metadata through such networks, and the results of an experiment and validation of the 
proposed method using a standard bibliographic dataset. 



I. INTRODUCTION 

Resource metadata plays a pivotal role in the func- 
tionality and interoperability of digital information 
repositories. However, in spite of its value, high quality 
metadata is difficult to come by [B]. [5S] demonstrates 
that although as many as 15 possible metadata prop- 
erties can theoretically be included in the widely used 
Dublin Core standard |27|, few are frequently used in 
collections whose metadata are generally created by the 
author's themselves |28j. The problem of poor and in- 
complete metadata is expected to worsen as repositories 
are applied to materials collected beyond the traditional, 
centralized methods of publication and start to obtain 
data from web pages, blogs, personal multimedia collec- 
tions, and collaborative tagging environments. 

Metadata is a costly resource to create, maintain, 
and/or recover manually. There has therefore been 
significant research on automated metadata generation 
(e.g. by extracting metadata from the content of re- 
sources) . Natural language processing [25] and document 
image analysis techniques [Tj [TOl [171 [21] may extract 
keywords, subject categories, author, and citations 
(e.g. CiteSeer[2S]) from manuscripts. Furthermore, in 
[3], two metadata generators are demonstrated that 
successfully harvest and extract metadata from existing 
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resource source and content. Such content-based tech- 
niques are much less efficient for multimedia resources, 
e.g. video, music, images, and datasets. Reliable content 
analysis for such data is still an active research area and 
existing methods generally yield little content-related 
metadata. In addition, content-based approaches can be 
prohibitively expensive in computational terms [Tl] . 

For the reasons outlined above, methods for the gen- 
eration of metadata that do not rely on resource content 
have generated considerable interest. The recent growth 
in applications of "folksonomies" (i.e. community-based 
"tagging" [SI [IS]), has been, to some extent, inspired 
by the shortcomings of existing metadata generation 
methods. Unfortunately, human tagging only works well 
in situations where the number of participants greatly 
exceeds the number of resources to be tagged and where 
there is no requirement for controlled vocabularies or 
standardized metadata formats. 

In this article, we propose a system for automated 
metadata generation that starts from a common sce- 
nario: a heterogeneous repository contains resources 
for which varying degrees of metadata are available. 
Some resources have been imbued with rich, vetted 
metadata, whereas others have not. However, if it 
can be assumed that resources that are "similar" 
(e.g. similar in publication venue, authorship, date, 
citations, etc.) are more likely to have shared meta- 
data, then the problem of metadata generation can 
be reformulated as one of extrapolating metadata 
from metadata-rich to related, but metadata-poor re- 
sources. This article's experiment focuses on identifying 
which aspects of metadata similarity are best used to 
extrapolate resource metadata in a bibliographic dataset. 



As a case in point. 



describes a method to support 



the annotation of personal photograph collections. Once 
a user has annotated a photograph its metadata is au- 
tomatically transferred to photographs taken at similar 
times and locations. For example, a user photographs a 
group of friends at 3:45PM. Another photograph is made 
at 3:47PM. Since the second photograph was taken only 
two minutes after the first, it is likely that it depicts a 
similar scene. The system therefore transfers metadata 
from photograph 1 to photograph 2. Similarly, [2T] 
proposes a method of web page metadata propagation 
using co-citation networks. The general idea is that if 
two web pages cite other web pages in common, then the 
probability that they share similar metadata is higher. 
The user can later correct and augment any transferred 
metadata. 

The mentioned systems are strongly related to col- 
laborative filtering [11]. Collaborative filtering systems 
are commonly employed in online retail systems to rec- 
ommend items of interest to individual users. Using the 
principle that similar users are more likely to appreciate 
similar items, users are recommended items that are 
missing from their profiles but occur in the profiles of 
similar users. The collaborative filtering process can thus 
be regarded as an instance of metadata propagation. 
If users are considered resources and their profiles are 
considered "resource metadata" , it can be said that 
collaborative filtering systems "recommend" metadata 
from one resource to another based on resource similarity. 

A generalization of the above metadata propagation 
systems can be made in terms of the following elements: 

1. A mechanism to generate resource relations, i.e. as- 
sess their similarity. 

2. The determination of a metadata-rich subset of the 
repository's collection that can serve as a reference 
set. 

3. A means of propagating metadata from the 
metadata-rich reference set to a metadata-poor 
subset of the collection using the established re- 
source relations as a substrate. 

Such systems for the generation of metadata can be 
said to operate on a "Robin Hood" principle; they take 
from metadata-rich resources and give to metadata-poor 
resources, with the exception that metadata is not a 
zero-sum resource. This mode of operation has a number 
of desirable properties. First, it reduces the need for the 
costly generation of metadata; metadata is automatically 
extrapolated from an existing metadata-rich reference 
collection to a metadata-poor subset. Second, resource 
relations can be defined independent of content and 
metadata extrapolation can thus be implemented for 
wide range of heterogeneous resources, e.g. audio, video, 
and images. 



This article outlines a proposal for a metadata prop- 
agation system designed for scholarly repositories that 
takes advantage of the multiple means by which two 
resources can be related (e.g. co-citation, citation, co- 
author, co-keyword, etc.). Figure [I] presents the outline 
of the proposed system's components and processing 
stages. First, resource metadata is extracted from 
the collection of a repository. Second, an associative 
multi-relational network (i.e. a directed labeled graph) 
of resource relations is derived from a subset of the 
available metadata. Third, a metadata-rich subset of 
the collection is selected to serve as a reference data set. 
Fourth, and finally, metadata is propagated (i.e. extrap- 
olated) from the metadata-rich reference set to all other 
metadata-poor resources over the associative network 
of resources after which the repository is updated. 
Human validation can vet the results of the metadata 
extrapolation before insertion into the repository occurs. 



1. extract resource metadata 




associative networl< 



3. provide associative 
network and existing 
resource metadata 



5. update 
metadata-poor resources 



5. propagate metadata 

to metadata-poor resources 



FIG. 1: System outline 

It is important to emphasize that this system requires 
the existence of some preliminary metadata both for 
the construction of resource relations and for metadata 
propagation. Furthermore, the quality or accuracy 
of the preliminary metadata is important in ensuring 
successful results (i.e. to avoid a "garbage in, garbage 
out" scenario). However, the metadata being propagated 
can be different from the metadata used to generate 
resource relations. For instance, in the manuscript 
domain, the propagation of keyword metadata may 
be most efficient along resource relations derived from 
citation metadata. Therefore, two aspects affect the 
efficiency of metadata propagation: the type of resource 
relations and the algorithm used to propagate metadata. 
It is important to note that no new metadata values 
are created in model proposed in this article. While it 
is important for resources to maintain metadata, this 
method only propagates pre-existing metadata values 
and thus, does not increase the discriminatory aspects 
that metadata should and generally provides. While 
like resources should have similar metadata, variations 
should also exist to make sure that a resource's metadata 
accentuates the unique characteristics of the resource. 



This paper will first discuss two algorithms to define 
sets of resource relations and represent these relations in 
terms of associative networks. It will then formally define 
a metadata propagation algorithm which can operate on 
the basis of the generated resource relations. Finally, the 
proposed metadata generation system is validated using 
a modified version of the KDD Cup 2003 High-Energy 
Physics bibliographic dataset (hep-th 2003) [3D]. While 
it is theoretically possible for this method to work on 
other resource types (e.g. video, audio, etc.) as it doesn't 
require an analysis of the content of the resources, only 
their metadata; it is only speculated that the results of 
such a method would be viable in these other, non-tested, 
domains. 



II. CONSTRUCTING AN ASSOCIATIVE 
NETWORK OF REPOSITORY RESOURCES 



Wide Web, for instance, is an associative network 
based on occurrence data because a web-page makes a 
direct reference to another web-page via a hyper-link 
(i.e. the href HTML tag). For manuscript resources, 
occurrence information usually exists in citations. For 
instance, if resource rii references (i.e. cites) resource 
Uj then their exists an edge Cij^citc- One potential 
algorithm for determining the edge weight is to first 
determine how many other citations resource rii cur- 
rently maintains. That is, if resource rii also cites 50 
other resources then resource n,- is ^i; as similar to 



j' ^i,j,citc 



50- 



Similarly, if resource rii only cites 
resource rij then the strength of tie to resource rij is 



greater, Wi 



by Eq. HA 



?".citc 



1.0. The general equation is defined 



where the function meta(ni,cite) returns 
the set of all citations for resource rii. This equation 
only holds if resource Uj e meta(ni,cite). Eq. HA 



makes use of the /i notation in order to generalize the 
equation for use with any direct reference property types. 



An associative network is a network that connects 
resources according to some measure of similarity. 
An associative network is represented by the data 
structure G = {N,E, W) where N is the set of re- 
sources, E C N X N the set of directed relationships 
between resources, and W is the set of weight values 
for all edges such that \W\ — \E\. Any edge eij_^ 
with corresponding weight Wi^^^ expresses that there 
exists a directed weighted relationship constructed using 
properties of type /x from resource rii to resource Uj. 
The explicit representation of fi is necessary because 
an associative network can be constructed according 
to different properties (i.e. authorship, citations, key- 
words, etc.). As will be demonstrated, certain network 
fi relationships are better (in terms of precision and 
recall) at propagating certain property types than others. 

The remainder of this section will describe two asso- 
ciative network construction algorithms. One is based 
on occurrence metadata where a resource is considered 
similar to another if there is a direct reference from 
one resource to the other (e.g. a direct citation). The 
other algorithm is based on co-occurrence metadata 
and thus, considers two resources to be similar if they 
share similar metadata. That is, two resources are 
deemed similar if the same metadata values occur in 
both their properties (i.e. same authors, same keywords, 
same publication venue, etc.). Depending on how the 
repository represents its metdata some property types 
will be direct reference properties and others will have 
to be infered through indirect, co-occurence algorithms. 



Wi. 



|meta(ni, /i)| 



rij G meta(ni, /i) 



The running time of the algorithm to construct an 
associative network based on direct, occurrence prop- 
erty types is 0{N) since each resource must be checked 
once and only once for direct reference to other resources. 



B. Co-occurrence Associative Networks 

Co-occurrence networks are created when resources 
share the same metadata property values. For instance, 
if two resources share the same keyword, author, or ci- 
tation values then there exists some degree of similarity. 
For a co-occurrence network the edge weight for any 
two resources, Wij,cofj, and Wj,i,co/j, is a function of the 
amount of metadata properties of type fi that rii and 
rij share in common. A specific example of this could 
be a co-keyword associative network created when two 
resources have similar keywords. For example, suppose 
the resource nodes rii and rij have the following list of 
keyword properties presented in Table |l] 



keyword- 1 



repository 

images 



keyword- 2 keyword- 3 



metadata particle 
repository metadata 



TABLE I: Keyword metadata for resources rii and rij 



A. Occurrence Associative Networks 

An associative network can be constructed if direct 
references connect one resource to another. The World 



In Table IT] resource n^ and rij share two keywords 
in common, namely repository and metadata. The 
edge weight between these two resources is a function 
of the amount of keywords they share in common. 



Eq. II B and the size of the keyword count of both do not guarantee that Wij,^ = Wj 



resources. Therefore, according to Eq. |IIB| the edges 
connecting resource n^ to rij and Uj to tij have a weight 

01 U^nj,nj , cokey — ^nj,n^, cokey — U.O. 



-^t-J,f^ 



"^,],t^ 



L^ykeei k.u ^^.'^.M 



co{ni,nj, ^) — meta(ni,/z) fl meta(nj , /Lt) 



such that 



so that 



Wi. 



\co{ni,nj,fj.)\ 



[|meta(ni, /x)| + |irLeta(nj,/i)|] — |co(ni, Uj, ii)\ 



vm 



Notice that the co-occurrence algorithm in Eq. 
returns a co/i representation. This means for keyword 
properties, the returned weight is a co- key word similarity 
weight. Similarly, for authorship metadata, the returned 
weight is a co-authorship weight. The running time 
of the algorithm to construct a co-occurrence network 
is 0{ ^ 2^ ) since each resource's /i-properties must 
be checked against every other resource's /^-properties 
{N^), except itself {—N), once and only once {h). 



III. METADATA PROPAGATION ALGORITHM 

Reconstructing the metadata for a metadata-poor 
collection of resources is dependent not only on the 
associative network data structure, but also upon the 
use of a metadata propagation algorithm. The algorithm 
chosen is a derivative of the particle-swarm algorithm 
[25] . Particle-swarm algorithms are a discrete form of the 
spreading activation algorithms [21 [31 HJ [51 [T^j . Because 
particles are indivisible entities, it is easy to represent 
metadata properties as being encapsulated inside a 
particle. These metadata particles are then propagated 
over the edges of the associative network. Upon reaching 
a resource node that is missing a particular property 
type, the particle recommends its property value to the 
visited resource. This section will formally describe the 
metadata propagation algorithm before discussing the 
results of an experiement using a bibliographic dataset. 

Every resource node in an associative network is 
supplied with a single particle, pi G P, such that 
\P\ = |A^|. The particle pi encapsulates all the meta- 
data properties of a particular resource rij. Therefore, 
meta(ni,/i) = meta(pj , /u) for all /^. Particle Pi has a 
reference to its current node Ci € N such that at t = 0, 
Ci = rii. The particle pi begins its journey (t — 0) at 
its home node, rii, and traverses an outgoing edge of 
Uj. Particle edge traversal is a stochastic process that 
requires the outgoing edge weights of each node to form 
a probability distribution. Therefore, the set of outgoing 
edge weights of relation type /i for rii, out(ni,/i), must 
be normalized as represented in Eq. |III| and Eq. |III[ 



E 



Wi.. 



1.0 



VjGe; 



The function 0(out(ni,/x)) is defined such that it takes 
a set of outgoing edges of relation type ^ of node n, and 
returns a single node rij based upon the outgoing edge 
weight probability distribution, where e^.^^^ € out(ni,/i). 
This is how a particle traverses an associative network. 

The particle pi also has an associated energy value 
ti G [0, 1]. Each time an edge is traversed, the particle Pi 
decays its energy content, e^, according to a global decay 
value, S € [0,1]. Particle energy decay over discrete 
time t is represented in Eq. Ill The rational for decay 



is based on the intuition that the metadata property 
values of a particular particle become less relevant the 
further the particle travels away from its source node 
{ci at t — 0). Therefore, the further a particle travels in 
the network, the more that particle's energy value (or 
recommendation influence), e, is decayed. 



e,{t + I) = (l - S)e,{t) 

The energy value of a particle defines how much rec- 
ommendation influence a particle's metadata property 
values has on a visited metadata-poor node. Each time 
a particle traverses a node with missing metadata prop- 
erties, it not only recommends its metadata property 
values to that node, but also increments the appropriate 
property value with its current energy value e^. In Figure 
[2] at t = 0, before the propagation algorithm has been 
executed, resource 713 has no keyword values. Therefore, 
when particle pi reaches n^ at i = 1, particle pi recom- 
mends its keyword property values (keyword={swarm, 
algorithms}) to node n^ with an influence of ei = 0.85. 
At i = 2, particle p2, with £2 — 0.723, recommends 
its keyword property (key word= {swarm}) to node 
n^. Notice that the recommendation of the keyword 
property value 'swarm' is reinforced each time that 
property value is presented to n^. 

The function of a single particle, pi, at a particular 
node. 



Lj, 



is represented in pseudo-code in Algorithm 
nl where rec(nj,/j,) returns the set of previous property 
values to Uj for a property of type fj.. 



Unlike Eq. II B for co-occurrence edges, these equations 
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FIG. 2: Particles recommending metadata information to a 
metadata-poor node 





Input: recommendMeta{nj,pi) 


1 


# Pi updates the metadata of Uj for all property types; 


2 


foreach (/^—property) do 


3 


# first ensure that Uj is metadata-poor at the particular 




/^-property; 


4 


if (|meta(nj, /j)| == 0) then 


5 


# update the metadata-poor node's /.i-property with 




the fjL property value of pi ; 


6 


foreach {x e meta{pi,/i)) do 


7 


found = false; 


8 


# if property value already exists, increment its 




energy value with ef, 


9 


foreach {y g rec{nj,fi)) do 


10 


if (x value(y)) then 


11 


energy (j/) = energy (?/) -|- e^; 


12 


found = true; 


13 


end 


14 


end 


15 


# if no recommended value exists, add to n^'s 




recommendat ions ; 


16 


if (Ifound) then 


17 


addRec(x, e^) = x; 


18 


end 


19 


end 


20 


end 


21 


end 



Algorithm 1: Particle pi recommending metadata 

properties values to rij 



If Algorithm IT] is called recommendMeta(nj,pi) then 
the full particle propagation algorithm can be described 
by the pseudo-code in Algorithm [2] The process of 
moving metadata particles through the associative net- 
work and recommending metadata-poor nodes metadata 
property values continues until some desired t is reached 
or all particle energy in the network has decayed to 0.0, 



E 



Vi' 



0.0. 





Input: propagate(/i) 


1 


# (5 is a global energy decay value ; 


2 


(5 = 0.15 ; 


3 


# create a particle for each node ; 


4 


foreach (n^ e A^) do 


5 


meta(pi, /.t) = meta(rai,/.4) : V/.t ; 


6 


ti = 1.0 ; 


7 


c,=ni; 


8 


end 


9 


# propagate metadata particles throughout ^ network ; 


10 


t = ; 


11 


while (Evp.sP'^i > 0-0 ^^ * < maxSteps) do 


12 


foreach {pi g P) do 


13 


# if Ci has no outgoing edges, freeze the particle ; 


14 


if (|out(ci,/i)| > 0) then 


15 


Ci = e{out{ci, fi)) ; 


16 


ei = ti* {1 — S) ; 


17 


# do not recommend metadata to the particle's 




home node ; 


18 


if [ci \ = rii) then 


19 


recommendMeta(ci,pi) ; 


20 


end 


21 


end 


22 


end 


23 


i = i + l ; 


24 


end 



Algorithm 2: Propagating metadata particles 
through an associative network of type ^ 



By artificially reducing the amount of metadata in the 
full bibliographic dataset, it is possible to simulate a 
metadata-poor environment and at the same time still be 
able to validate the results of the metadata propagation 
algorithm. The section is outlined as follows. First, the 
dataset used for this experiment is described. Second, a 
short review of the validation metrics (precision, recall, 
and F-score) is presented. Third, the various system 
parameters are discussed. Finally, the results of the 
experiment are presented as a validation of the systems 
use for manuscript-based digital library repositories. 
Further research into other domains besides manuscripts 
will demonstrate the validity of this method for other 
resource types. 

The dataset used to validate the proposed system 
is a modified version of the hep-th 2003 bibliographic 
dataset for high energy physics and theory [19;.[3T| 
A modified version of the hep-th dataset, as used in 
|16j . is represented as a semantic network containing 
manuscripts (29,014), authors (12,755), journals (267), 
organizations (963), keywords (40), and publication 
date in year/season pairs (60). These nodes are then 
connected according to the following semantics: 



IV. AN EXPERIMENT USING THE 2003 
HEP-TH BIBLIOGRAPHIC DATASET 

This section will present the results of the proposed 
metadata generation system when attempting to recon- 
struct an artificially atrophied bibliographic dataset. 



• writes(a,TO): author a wrote manuscript m 

• date_published(?Ti,(i): manuscript m was published 
on date d 

• organization_of(a,o): author a works for organiza- 
tion o 



• published jn(mj'): manuscript m was published in 
journal j 

• cites(mx,my): manuscript rrix cites manuscript rUy 

• keyword_of(?7i,fc): manuscript m has keyword k 

For the purposes of this experiment, the semantic net- 
work from [TIT was transformed into a list of manuscripts 
and their associated metadata property name/value 
pairs. These manuscript properites include: authors, 
date of publication, citations, keywords, publishing 
journal, and organizations. Of the 29,014 manuscript 
nodes, different occurrence and co-occurrence algorithms 
were used to construct the following associative networks: 



1. citation: manuscript rrii maintains an edge to 
manuscript rUj if rrii cites mj (27,240 edges) 

2. co-author: manuscripts maintain an edge if they 
share authors (724,406 edges) 

3. co-citation: manuscripts maintain an edge if they 
share citations (23,089,616 edges) 

4. co-keyword: manuscripts maintain an edge if they 
share keywords (12,418,172 edges) 

5. co-organization: manuscript maintain an edge if 
they share organizations (33,947,083 edges) 

Though not explored empirically, it is worth noting 
that link prediction algorithms can be employed to 
resolve issues relating to edge sparsity in the network. 
In particular, the methods proposed in [TF and [131 ^^^ 
such algorithms. 



A. A Review of Precision, Recall, and F-Score 

The results of the metadata generation experiment are 
evaluated according to the F-score measure so therefore, 
it is important to provide a quick review of precision, 
recall, and F-score within the framework of the notation 
presented thus far. For a particular property fi, precision 
is defined as the amount of property values of type /i 
received that were relevant relative to the total number 
of property values retrieved overall. This is represented 
in Eq. IV A where the function Tec(ni,fi) returns the set 
of recommended property values for resource rii of type 
fi, while meta(ni, /i) returns the set properties values 
of type fi previously existing for resource n^. Since the 
validation is against an artificially atrophied resource 
set, the recommended property values are checked 
against the previously existing property values (prior to 
artificial atrophy). 



Prifi) 



|meta(ni, /i) n rec(ni, fi) \ 
|rec(ni,^)| 



Recall, Eq. |IV A| on the other hand, is defined as 
the proportion of relevant property values received to 
the total amount of relevant property values possible. 
For example, if resource n^ previously (before artificial 
atrophy) had the property value key word= {swarm} and 
is recommended the property value key word= {swarm}, 
then there is a 100% recall. On the other hand, 
if resource rii previously had the property values 
keyword= {swarm, network} and is recommended the 
property value key word= {swarm}, then there is a 50% 
recall, whereas its precision is 100% in both cases. 



Re{fi) = 



|meta(ni, /i) n rec{ni, fi)\ 
|meta(ni, /x)| 



Precision and recall tend to be inversely related, 
Pr w j^. This inverse relationship is understood 
best when examining the extreme cases. If every 
possible property value was provided to a resource 
(|rec(ni,/i)| -^ oo), and that resource originally only had 
one property value (|meta(ni,^)| = 1) then the recall 
would be 100% while the precision would be near 0%. 
At the opposite extreme, if a resource previously had 
every possible property value in its original metadata 
(|meta(ni,/i)| -^ oo) and was recommend only one 
property value (|rec(ni,/i)| = 1), then the precision 
would be 100%, but the recall would be near 0%. While, 
in some systems, precision and recall can be inversely 
related, it is the goal of information retrieval systems 
that are validated according to this criterion to achieve 
both high precision and recall values. 



Finally, F-score, Eq. |IVA 



can be used to combine 
precision and recall into a single measure. Note that 
different associative networks will perform differently 
for different property types. For instance, co-citation 
networks will, intuitively, preform better at propa- 
gating keyword values than co-organization networks. 
Therefore, the F-score measure will be represented as 
F{fj,x:,fiy) in order to express the F-score of a network 
created from metadata properties of type Hy propagating 
property values of type fix- Precision and recall can be 
represented in a similar fashion though the results of 
the experiment to follow are expressed according to the 
F-score measure only. 



FifJ-x,tJ-y) = 



2 ■ Prjfix) ■ RejfXx) 
Prefix) + Reifix) 



B. Experiment Parameters 

The experiment was set up to deter- 
mine various F-scores, F(fj,x, fJ-y), where 
fix g {auth, cite, datejour, key, org} and /ij, G 
{cite, coauth,cocite, cokey, coorg}. This means that 
for every type of associative network generated, an 
F-score for each metadata property type was deter- 
mined. Since the hep-th 2003 bibhographic dataset is 
a metadata-rich dataset, it was necessary to destroy 
a percentage of the metadata to test whether or not 
the metadata generation algorithm could reconstruct 
the property values for the selected metadata-poor 
resources. Therefore, the tunable parameter, density, 
d G [0.01,0.9], was created. The density of the network 
metadata ranges from 1% of the network resources 
containing metadata to 99% of the resources. Given the 
percentage parameter, resources were randomly selected 
for atrophy before the metadata propagation algorithm 
was run. 





Input: experiment() 


1 


# run the metadata propagation algorithm for each 




associative network type ; 


2 


foreach {fiy £ [coauth,cocite, cokey, cite]) do 


3 


loadNetwork(/^y) ; 


4 


foreach {fj,x £ [auth, cite, date, jour, key, org]) do 


5 


# atrophy a randomly selected percentage of the 




network ; 


6 


for (9 = 0.01, a < 1.0, d = d + 0.2) do 


7 


killMeta(l - d) ; 


8 


propagateMeta{/Ja;) ; 


9 


# allow metadata-poor resources to accept only a 




certain percentage of their recommended property 




values ; 


10 


for (p = 0.0, p <= 1.0, p = p -1- 0.1) do 


11 


acceptMeta(p) ; 


12 


calcuateF(/Ja;,Pj,) ; 


13 


end 


14 


end 


15 


end 


16 


end 



Algorithm 3: Determining the F-score for the 
various experimental parameters 



With the potential for 99% of the network containing 
metadata, the propagation of metadata to the lacking 
1% would be overwhelming (a high recall with a low pre- 
cision). In order to allow nodes to regulate the amount 
of metadata property values they accept, a second 
parameter exists. The percentile parameter, p € [0,1], 
determines the energy threshold for property value 
recommendations. Since each recijii, fj,) entry has an 
associated energy value (recommendation influence), a 
range from O"^ percentile, meaning all provided property 
values are accepted to 100**^ percentile, meaning only 
the top energy property value is accepted, is used. The 
pseudo-code for the experimental set-up is presented in 
Algorithm [3] In Algorithm [s] killMetaQ, accept]VIeta(), 
and calculateF() do not have accompanying pseudo-code. 

The general expected trend is that as the density 
of the network increases, the recall increases and the 
precision decreases. With more property values being 
propagated, any metadata-poor record, on average, will 
receive more recommendations than are needed. For 
instance, a manuscript only has one publishing journal, 
therefore a recommendation of 100 journals is going 
to yield a very low precision (0.01). To balance this 
effect, a percentile increase will tend to increase the 
precision of the algorithm at the expense of recall. When 
only the highest energy recommendations are accepted, 
the probability of rejecting a useful recommendation 
increases. In the case of journal propagation, if only 
the 100*'' percentile recommendation is accepted, then 
only the highest energy recommendation is accepted. If 
this journal recommendation is the correct publishing 
venue, then there is 100% recall and precision. If not, 
then there is 0% recall and precision. Depending on the 
amount of values needed to fill a particular property, 
different p values will be most suitable than others. 



C. The Results 

This section presents the results of the experiment 
outlined previously in Algorithm |3] For every associative 
network type and for every metadata type, a F-score ma- 
trix was determined for every combination of d (density) 
and p (percentile). These F-score values were calculated 
as the average over 20 different runs of the experiment. 
Tables ITT] and |III| provide the max and mean F-scores 
for each network/metadata pair over the entire d/ p set. 
Note that the bold faced values are those p-x/ y^y pairs for 
which a landscape plot is provided. The italicized values 
are experimental anomalies since the same metadata 
that was used to generate the associative network is 
also the same metadata being propagated. For all other 
combinations, metadata of a particular p type exists to 
create an associative network and metadata properties 
of a different p type is being propagated over those 
edges. For instance, a co-authorship network is used to 
propagate citation property values. 

The following landscape plots expose the relationship 
between d and p. A short explanation of the intuition 
behind each plot is also provided. 

Intuitively, it makes sense that a co-authorship 
network would perform well when propagating citation, 
journal, keyword, and organization property values 
which are represented in Figure |3^, Figure [SJd, Figure 
|4^, Figure |4]3 respectively. The performance is a result 
of the fact that collaborating authors tend to cite 
themselves, publish in similar journals, write about 
similar topics, and are within similar organizations. 
Notice the effect that percentile (p) has on Figure [sk 
as opposed to Figure Wp. Since there tend to exist 



network/metadata 


author 


citation 


date 


journal 


keyword 


organization 


citation 


0.1829 


0.1757 


0.0606 


0.2438 


0.3913 


0.2782 


co-author 


0.6218 


0.1300 


0.0717 


0.2630 


0.2795 


0.6457 


co-citation 


0.0770 


0.1821 


0.0780 


0.2081 


0.2213 


0.1350 


co-keyword 


0.0073 


0.0248 


0.0472 


0.1904 


0.8689 


0.0420 


co-organization 


0.0709 


0.0236 


0.0508 


0.1918 


0.1180 


0.5000 



TABLE II: Max F-scores 



network/metadata 


author 


citation 


date 


journal 


keyword 


organization 


citation 


0.1367 


0.1327 


0.0441 


0.2133 


0.3246 


0.2218 


co-author 


0.28^8 


0.0780 


0.0548 


0.2004 


0.1958 


0.3935 


co-citation 


0.0338 


0.0697 


0.0539 


0.1554 


0.1509 


0.0768 


co-keyword 


0.0032 


0.0160 


0.0385 


0.1468 


0.3240 


0.0330 


co-organization 


0.0312 


0.0145 


0.0392 


0.1410 


0.0909 


0.1554 



TABLE III: Mean F-scores 




FIG. 3: Co-authorship network propagating a. citation F(cite, coauth) and b. journal _F(jour, coauth) metadata 



many citation property values (manuscripts cite many 
manuscripts), lower percentile values {p ~ 0) ensures 
that there is a high recall. When p — 1.0, only the 
top citation is accepted and therefore the F-score drops 
(very poor recall). On the other other hand, in Figure El 
when p = 0.0, there are many journal recommendations. 
This is not desirable since a journal property only has 
one value (a manuscript is published in only one venue) . 
Therefore, at p = 1.0, only one journal value is accepted 
into the resource's journal property. In situations where 
few property values are expected, the F-score is best 
with a high p. 

A co-citation network. Figure |5] performs best with 
journal and keyword properties. This means that 
manuscripts are likely to cite other manuscripts with 
similar journal venues and since citation tends to be 
within the same subject domain, the probability of 
similar keyword metadata increases. Again, notice the 
effect of p on journal metadata propagation. The shape 
of the Figure [5^ graph nearly mimics the shape of Figure 
[3]3. Likewise, for Figure |5]d and Figure |4^. Again, the 
expected property value number is a major factor in 
determining the system's p parameter. 

A citation network, like a co-citation network performs 
well for author, journal, keyword, and organizational 



properties Figure |6] and Figure [8] It is interesting to 
note how much better a citation network works for 
p ~ 0.0. Since a citation network isn't symmetric, 
there is a chance that a particle will reach a dead 
end. When a particle reaches a dead end, it no longer 
recommends property values. Furthermore, citations are 
in a hierarchy with more recent publications being at the 
top of the hierarchy (manuscripts can not cite forward 
in time). Particles therefore trickle down the hierarchy 
via a single, non-recurrent path from top to bottom. 
This "plinko ball" effect is represented in Figure [7] The 
lack of recurrence in citation networks tends to produce 
a high precision with a lower recall. High precision and 
low recall is exactly what a low p produces. Therefore, 
since the topology of the citation network yields the 
same effect, the effect of p as p ^ 0.0 isn't as pronounced. 



As can be noticed from Table |II] Table |III| and Figure 
|8k, the keyword property performs best in a citation 
network. A direct reference from one document to an- 
other is a validation of the similarity between documents 
with respect to subject domain. Therefore, the tendency 
for citing documents to contains similar keyword values 
is high. For instance, refer to the citations of this article 
(references in this manuscript's bibliography). Every 
cited manuscript is either about automatic metadata 
generation, bibliographic networks, or network analysis. 
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FIG. 4: Co-authorship network propagating a. keyword F(key, coauth) and b. organization F(org, coauth) properties 
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FIG. 5: Co-citation network propagating a. journal F(jour, cocite) and b. keyword _F(key, cocite) properties 




FIG. 6: Citation network propagating a. author F(auth, cite) and b. journal F(jour, cite) properties 



A co-keyword network does not perform well for most 
property types except the journal property, Figure [9k. 
This makes sense since manuscripts on similar topics are 
likely to be published in similar journals. 



V. FUTURE WORK 

This paper has provided a preliminary exploration 
of metadata generation in terms of metadata property 
propagation within an associative network of repository 
resources. Further research in this area may prove 
useful for other network types such as those generated 



from other metadata properties. For instance, it may 
be of interest to study the effect of this algorithm on 
usage networks [T. Usage metadata, unlike citation 
and journal metadata, is applicable to every accessible 
resource. It would be interesting to see what co-usage 
means for a particular genera of resources by determining 
which metadata properties these networks are best at 
propagating. 

A variety of propagation algorithms may also be 
explored. It is assumed that a particle will take only 
edges of a particular fi type for the duration of their 
life-span. Different path types might be an important 
aspect of increasing the precision and recall performance 
of this method. For instance, keyword metadata that 
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FIG. 7: Citation networks are non-recurrent networks 



first propagates over co-authorship edges and then over 
co-citation edges might provide better results. Methods 
to implement such propagation algorithms have been 
presented in [211 US]. Also, different edge types can 
be merged such that all co-keyword and co-authorship 
edges are collapsed to form a single edge. 

What has been presented in this study is the results 
of this algorithm without the intervention of any human 
components (besides the initial creation of metadata 
through the hep-th dataset creation process). Future 
work that studies this method with the inclusion of hu- 
mans that help to validate and "clean" the recommended 
metadata would be telling of how much this method is 
able to speed up the process of generating accurate and 
reliable metadata for metadata-poor resources. Such an 
analysis is left to future research. 

Finally, multiplicative effects due to particle interac- 
tion may effect the results of the algorithm. For instance, 
if two particles, pi and pj, meet at a particular node, 
Ufc, and Pi and pj have similar metadata then the foot- 
print they leave at n^; should be more noticeable. Be- 
cause two different metadata sources are supplying the 
same property values, there is an increased probability of 
that recommended metadata value being correct. Cur- 
rently, only a summation is being provided. It may be 
interesting to multiply this summation by the number 
of unique particles that provided energy for a particu- 
lar recommended metadata value. The variations of this 
preliminary framework will be explored in future work. 



VI. CONCLUSION 

Automatic metadata generation is becoming an 
increasingly important field of research as digital library 
repositories become more prevalent and move into the 



arena of less strongly controlled, decentralized collections 
(e.g. arXiv and CiteSeer). The creation and mainte- 
nance of high-quality, detailed metadata is hampered on 
numerous levels. Manual metadata creation methods are 
costly. Recent efforts to leverage the collective power of 
social tagging (i.e. "folksonomies" ) may address some of 
the shortcomings of the manual creation of metadata and 
result in viable models for online resources that do not 
require strongly controlled vocabularies and metadata 
ontologies. However, it is doubtful that "folksonomies" 
can be generalized to situations that require vetted, 
well-standardized metadata. The automated creation of 
metadata on the basis of content-analysis is a promising 
alternative to the manual creation of metadata. It is 
conceivably more efficient in situations where textual 
data is available and allows for more formal control 
of the type and nature of metadata that is extracted. 
However, it can be unreliable for non-text resources, 
yield low-quality metadata and can be computationally 
expensive. 

This article proposed another possible component of 
the metadata generation toolkit which may complement 
and support the above mentioned approaches. Instead 
of creating new metadata, metadata is propagated from 
a metadata-rich subset of the collection to similar, but 
metadata-poor subsets. The substrate for this extrapo- 
lation is an associative network of resource relations cre- 
ated from other available metadata. Metadata propa- 
gation may provide a computationally feasible means of 
generating large amounts of metadata for heterogeneous 
resources which can later be fine-tuned by manual inter- 
vention or cross-validation with content-based methods. 
The article finally provided experimental results using 
the High-Energy Physics bibliographic data set (hep-th 
2003). Human intervention may play an important role 
in fine-tuning the metadata propagation algorithm. The 
results of this experiment are promising and there still 
exists a range of potential modifications to this basic 
framework that may lead to even better results. 
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FIG. 8: Citation network propagating a. keyword f (key, cite) and b. organization F(org, cite) properties 
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FIG. 9: Co-keyword and co-organization networks propagating a. journal F(jour, cokey) and b. F(jour, coorg) properties, 
respectively 
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date information coming from the slacdates citation 
tarball. Because institutions were often not presented in 
a consistent format, the email domain of the submitter 
(if available) was used as a surrogate for institution. 
Because many authors had no associated email address, 
domain information is not available for all authors. Con- 
solidation was performed on journal names, domains, 
and author names. A nominal amount of hand-cleaning 
to correct spelling or formatting problems was also 
performed." 



README.txt) of the data set and a few specifics 
are quoted here: "Object and link properties such as 
title, authors, journal (if published), and various dates 
were extracted from the abstract files with additional 



